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Abstract 


Blocked algorithms have much better properties of data locality and therefore can 
be much more efficient than ordinary algorithms when a memory hierarchy is in- 
volved. On the other hand, they are very difficult to write and to tune for particular 
machines. Here we consider the reorganization of nested loops through the use of 
known program transformations in order to create blocked algorithms automati- 
cally. The program transformations we use are strip mining, loop interchange, and 
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a variant of loop skewing in which we allow invertible linear transformations (with 
integer coordinates) of the loop indices. In this paper, we solve some problems 
concerning the optimal application of these transformations. We show, in a very 
general setting, how to choose a nearly optimal set of transformed indices. We then 
show, in one particular but rather frequently occurring situation, how to choose an 
optimal set of block sizes. 

Keywords: block algorithm, parallel computing, compiler optimization, 
matrix computation, numerical methods for partial differential equations, 
program transformation, memory hierarchy. 

Submitted to: Journal of Parallel and Distributed Computing . 


1 Introduction 


An essential fact of life in very-large-scale integrated circuits is that tran- 
sistors are cheap and wires are expensive. The concomitant fact in high- 
performance computing, especially parallel computing, is that computation 
is cheap and communication is expensive. The two types of communication 
that we are primarily concerned with here are communication between the 
processors in a multicomputer and communication between processors and 
main memory. 

Both these forms of communication are characterized by long latency 
and low bandwidth compared to the CPU rate. For instance, in the CRAY- 
1 memory was able to provide only 80 Mwords per second to the vector unit, 
which could produce one result and take in two operands per clock at 80 
MHz; thus the memory was too slow by a factor of three. This same phe- 
nomenon can be observed in the i860 RISC today, the NEC-SX supercom- 
puters, the Alliant machines, the CM-2, and most other high-performance 
machines. Communication speeds are likewise slower than processor speeds 
in multi computers such as the Intel iPSC/2. In that machine, processors 
communicate over 1-bit- wide channels but have full word- wide paths to local 
memory. While newer message passing machines will employ byte- wide com- 
munication channels, the evolving microprocessor already provides memory 
ports of 8 or 16 bytes. 

The principal architectural solution to these problems is to provide a 
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small but fast local memory at each processor. The memory may be man- 
aged by hardware on a demand basis (cache) or managed explicitly by soft- 
ware, either operating system or application. If the processor is B times 
faster than the data path to memory or to other processors, then it must 
make reference to that slow data path only once for every B operations 
in order not to be slowed down. For this to happen, it must get its data 
from the local memory roughly B — 1 times out of every B. Software must 
organize the computation so that this tt hit ratio” can be achieved. 


1.1 Block algorithms: Matrix Multiplication as an Example 

To achieve the necessary reuse of data in local memory, researchers have 
developed many new methods for computation involving matrices and other 
data arrays [5, 7, 16]. Typically an algorithm that refers to individual el- 
ements is replaced by one that operates on subarrays of data, which are 
called blocks in the matrix computing field. The operations on subarrays 
can be expressed in the usual way. The advantage of this approach is that 
the small blocks can be moved into the fast local memory and their elements 
can then be repeatedly used. 

The standard example is matrix multiplication. The usual program is 


for i = 1 to n do 
for j = 1 to n do 
for k = 1 to n do 

c[i,j] = c[i, j] + a[i,k] * b[k,j] ; 
od 
od 


od 


The entire computation involves 2 n 3 arithmetic operations (counting addi- 
tions and multiplications separately), but produces and consumes only 3 n 2 
data values. As a whole, the computation exhibits admirable reuse of data. 
In general, however, an entire matrix will not fit in a small local memory. 
The work must therefore be broken into small chunks of computation, each 
of which uses a small enough piece of the data. Note that for each iteration 
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of the outer loop ( e. , for a given value of i) n 2 operations are done and n 2 
data is referred to — no reuse. For fixed values of t and j , n computation 
and n data referred too — again, no reuse. 

Now consider a blocked matrix-multiply algorithm. 


for *0 = 1 to n step b do 
for j 0 = 1 to n step b do 
for kO = 1 to n step b do 

for i = t’O to min(iO + 6 — 1, n) do 
for j = jO to min(jO + b — 1, n) do 
for k = kO to min(fcO + b — 1, n) do 
c[i,j] = c[i,j ] + a[i,k] * b[k,j] ; 
od 
od 
od 
od 
od 
od 


First, note that in this program exactly the same operations are done on 
the sarnie data; even round-off error is identical. Only the sequence in which 
independent operations are performed is different from the unblocked pro- 
gram. There is still reuse in the whole program of order n. But if we consider 
one iteration with fixed tO, jO, and fcO, we see that 2 6 3 operations are per- 
formed (by the three inner loops) and 36 2 data are referred to. Now we can 
choose b small enough so that these 3 6 2 data will fit in the local memory and 
thus achieve b - fold reuse. (If this isn’t enough — if b < B in other words 
— then the machine is poorly designed and needs more local memory.) Put 
the other way, if we require J9-fold reuse, we choose the block size b = B. 

The subject of this paper is the automatic transformation of ordinary 
programs to blocked form. 

Our motivation for seeking such a capability is as follows. Many algo- 
rithms can be blocked. Indeed, recent work by numerical analysts has shown 
that the most important computations for dense matrices are blockable. A 
major software development of blocked algorithms for linear algebra has 
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been conducted as a result [6]. Further examples, in particular in the solu- 
tion of partial differential equations by difference and variational methods, 
are abundant. Indeed, many such codes have also been rewritten as block 
methods to better use the small main memory and large solid-state disk on 
Cray supercomputers [14]. All experience with these techniques has shown 
them to be enormously effective at squeezing the best possible performance 
out of advanced architectures. 

On the other hand, blocked code is much more difficult to write and to 
understand than point code. Writing it is a difficult and error-prone job. 
Blocking introduces block size parameters that have nothing to do with the 
problem being solved and which must be adjusted for each computer and 
each algorithm if good performance is to be achieved. Unfortunately, the 
alternative to having blocked code is worse: poor performance on impor- 
tant computations with the most powerful computers. For these reasons, 
Kennedy has stated that compiler management of memory hierarchies is 
the most important and most difficult task facing the writers of compilers 
for high-performance machines [12]. 


1.2 Program Transformation and Blocking; Previous Work 

We can view the reorganized matrix-multiply program in two ways. First, we 
can consider the matrices A, and C as j X j matrices whose elements are 
6x6 matrices. In this case, the inner three loops simply perform a multiply- 
add of one such block-element. This is the view taken by most numerical 
analysts. Second, we can derive the blocked program form the original, 
unblocked program by a sequence of standard program transformations. 
First, the individual loops are strip mined . For example, the loop 


for i = 1 to n do 
od 

is replaced by 


5 



for tO = 1 to n step 6 do 

for i = iO to min(iO + b — 1, n) do 

od od 


(Strip mining is a standard tool for dealing with vector registers. One may 
apply it “legally” to any loop. By legally, we mean that the transformed 
program computes the same result as before.) Strip mining, by itself, yields 
a six-loop program, but the order of the loops is not what is needed for a 
blocked algorithm. The second transformation we use is loop interchange . 
In general, this means changing the order of loops and hence the order in 
which computation is done. To block a program, we endeavor to move the 
strip loops (the iO, jO, and kO loops above) to the outside and the point loops 
(the i, j, and k loops above) to the inside. This interchange is what causes 
repeated references to the elements of small blocks. In the matrix-multiply 
example, the interchange is legal, but there are many interesting programs 
for which it is not, including LU and QR decompositions and methods for 
partial differential equations. 

This approach to automatic blocking, through loop strip mining and 
interchange, was first advocated by Wolfe [18]. It is derived from earlier 
work of Abu-Sufah, Kuck, and Lawrie on optimization in a paged virtual- 
memory system [lj. Wolfe introduced the term tiling. A tile is the collection 
of work to be done, i.e., the set of values of the point loop indices, for a fixed 
set of values of the block or outer loop indices. We like this terminology since 
it allows us to distinguish what we are doing — which is to decompose the 
work to be done into small subtasks (the tiles) — from the quite different 
task of decomposing the data a priori into small subarrays (the blocks), 
even though each tile does, in fact, act on blocks. Following Wolfe, we call 
the problem of decomposing the work of a loop nest index space tiling . 

Other authors have treated the issue of management of the memory 
hierarchy [8]. Some other treatments of the problem of automatic blocking 
have recently appeared [11], [4], [17], [18], [19]; none, however, gives the 
quantitative statments of the problem and the solution that we provide 
here. 
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1.3 Strip Mining and Loop Interchange Are Not Enough 
Consider the one- dimensional, discrete diffusion process 


for < = 0 to m do 

for i = 1 to n - 1 do 

«[*, *] = /(“[* - 1 , t - 1], «[i , t - 1], «[* + 1, t - 1])» 

od 

od 

At each time step (each iteration of the t loop) at every grid point, the value 
of ti(i) is updated by using the data at the three grid points i — 1, t, and 
* + 1 from the previous time step, t — 1. This process is typical of PDE 
computations. Let us apply strip mining and loop interchange to this code. 
The resulting program, which follows, is incorrect. 


for tO = 0 to m step bt do 

for *0 = 1 to n - 1 step bi do 

for t = tO to min(m, tO + bt — 1) do 

for t = *0 to min(n — 1, *0 + bi — 1) do 

u[t,t] = /(u[t- l,t — l],u[i,t- 1],«[*+ M - l]); 
od 
od 
od 
od 


One cannot advance the computation in time for a fixed subset of the grid 
points without advancing it for their neighbors; to update the values at the 
edge of the block of grid points, we require values from neighboring grid 
points outside the block that have not been computed. In other words, 
the loop interchanges that we performed were illegal and, the transformed 
program produces meaningless results. 

Wolfe’s second paper on tiling recognizes this fact. He advocates the use 
of a technique called loop skewing [19]. (This was also discussed by Irigoin 
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and Triolet [11].) By loop skewing, Wolfe means changing the index of the 
inner loop from the natural variable (i above) to the sum or difference of 
the old inner index and an integer multiple of the outer loop index. With 
this transformation, the code above can be changed as follows: 


for t = 0 to m do 

for r = i + lto< + n- 1 do 

u[r - t,t] = /(u[r — t — l,t — l],t*[r — M — l]> tt [ r — * + M — !]); 
od 
od 


Here we have used r — i + 1 as the inner loop index. Note that the inner 
loop now ranges over oblique lines in the (*, t) plane. We may now legally 
strip mine and interchange to get a tiled program: 


for tO = 0 to m step bt do 

for rO = tO + 1 to 10 + n — 1 step br do 
for t = tO to min(m, tO + bt — 1) do 

for r = max(rO,t + 1) to min(t 4- n — l,rO + br — 1) do 
u[r-t,t] = 

f(u[r - 1 - l,t - l],u[r - t,t - l],u[r - f + l,t - 1]); 
od 
od 
od 
od 


Figure 1 shows the tiles of computation in the original coordinates (t,f). 

In this paper, we consider the following generalization of Wolfe’s loop 
skewing. We allow all of the loop indices to be replaced by linear combina- 
tions of the original, natural indices. Let the computation be a loop nest of 
depth k. Let the natural indices be (ij , * 2 > • ■ • , *fc)- Let A be an invertible, 
k x k integer matrix. We would like to use • ■ ■ , Jjt) as the indices in 

a transformed program, where 






We can carry out this transformation in two steps. First , we replace every 
reference to any of the natural loop indices in the program by a reference to 
the equivalent linear combination of the transformed indices. If the rational 
matrix F = [f pq ]p iq =i = A~ T (A~ T denotes the inverse of A T ), then we 
replace a reference to i \ , for example, by the linear combination 


/n * Ji + fl 2 • J2 + • • • + fijk • 3k * 

Second, we compute upper and lower bounds on the transformed indices. 
We call this program rewriting technique bop index transformation . 

The first contribution of this work is a method for choosing the loop in- 
dex transformation A. We start from the assumption that the computation 
is a nested loop of depth k in which there are some loop-carried dependences 
with fixed displacements in the index space. We then consider the problem 
of determining which loop index transformations A permit the resulting 
index-transformed loop nest to be successfully tiled through strip mining 
and interchange. (The mechanics of automating these program transforma- 
tion is discussed in the compiler optimization literature [3].) We show that 
this problem amounts to a purely geometric one: finding a basis for real 
ifc-space consisting of vectors with integer components that are constrained 
to lie in a certain closed, polygonal cone defined by the dependence displace- 
ments. The basis vectors are then taken to be the columns of the loop index 
transformation A. We further show that the amount of reuse that can be 
achieved with a given amount of local memory, which is determined by the 
ratio of the number of iterations in a tile to the amount of data required by 
the tile, is dependent on A in a simple way. It is proportional to the (fc— l) th 
root of det(A ) where A is the matrix obtained by scaling the columns of 
A to have euclidean length one. 

We give a heuristic procedure for determining such an integer matrix A 
that approximately maximizes this determinant. We report on the results 
of some experiments to test its performance and robustness. 

Finally, we consider the optimal choice of tile size and shape, once the 
basis A has been determined. We show that it is straightforward to derive 
block size parameters that maximize the ratio of computation in a tile to 
data required by the tile, given knowledge of the flux of data in the index 
space and the blocking basis A. 
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1.4 Notation 


We use uppercase letters for matrices. The notation X = [xi,X 2 , . . . ,x/t] 
means that X has columns ii,X 2 , • • • > x k- The notation X = [x,j] means 
that X has elements Xij. 

In general, we use lowercase Greek letters for scalars. Let 

1 ifi “ J 
[0 otherwise 

The following symbols have the indicated meaning 


S 

A 

A 

F 

D 

C 

b = m 

a 3 

n 

u 


<f>: 


The index space — the set of values of the loop index vector 

The matrix that transforms natural to new loop indices 

The matrix A with its columns scaled 
to have euclidean length one 

A~ t 

The matrix of dependences 

The matrix of data fluxes 

The ratio of the volume of a tile to its surface area 

The vector of block size parameters 

A normal vector to a tiling hyperplane; one of the columns of A 

A bound on the size of local memory. 

The time required to perform the computation at a point 
in the index space. 

The time required to move data across one unit of area 
in the hyperplane normal to aj. 


We shall make considerable use of determinants. If X = [xi , . . . , x n ] is 
a real, square matrix, then the real- valued function det( A) is the volume of 
the parallelepiped subtended by the columns of X: 

Y^ajXj | 0 < aj < 1 

j = i 

Thus, det(J) = 1. Also det(X) = det(X T ). If Y is also n x n, then 
det(Xy) = det(X)det(y). If T = [t. j] is a triangular matrix, then det(T) = 

(til •<22 , ”tnn)> 
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Let sp{z} denote the one-dimensional subspace spanned by the vector 
z, and let sp{z} -L denote its orthogonal complement. 

Lemma 1 Let xi have length one. Let X\ = [i 2 ,...,x n ]. Let P\ be the 
orthogonal projection matrix on sp{xi}. Then 

det(X) = det(PiXi) . 


Proof: Let c = (c 2 ,...,c Tl ) :r be a A: - 1-vector chosen so that for each 

2 < j < n, Xj — CjX i is orthogonal to x\. Construct the matrix 


-f 1 - cT 'l 
■ ( 0 )' 


Then, since C is triangular and has unit diagonal, det(C) — 1. Since x\ is 
a vector of length one, XC = [xi, Pi-Xi]. Thus, 

det(A) 2 = det^ 3 *) 

= det {C T X T XC) 

det([x l ,P l X 1 ] T {x 1 ,P l Xi}) 

0 

(PiX if(PiXi) 
det(Pi*i) 2 . 


= det 


>1 * ' 
C 


/ 

) 


2 Statement of the Problem 

We are given a convex set of lattice points S € Z*. This is the set of all values 
of the k dimensional natural index i = (*i, * 2 > • • • » *fc) 1° 1°°P nest. We call 

S the index space of the loop nest, which is the standard term, even though S 
is a finite subset of Z k . We are also given a set of dependence displacements 
D = [di, . . . , dm] where each integer vector dj 6 Z fc is the displacement in the 
index space from iterations that produce values to iterations that use them. 
The integer m is the number of such dependences. Hence, for all points 
i e S and for each 1 < j < m, if t - dj £ S , then iteration i - dj must have 
been performed before we perform iteration i. We may also consider anti- 
dependences and output dependences and treat them in the same manner. 
(See [8] for the definition of the various kinds of dependences.) 
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We now consider the blocking problem. The problem is to partition S 

S = Si US 2 U ...US P (1) 

where the subsets of index points {«Sj} are disjoint. The j th tile is the task 
of executing the loop body for all values of the loop index in Sj . 

Some restrictions are in order if this partition of S is to be of any use. 
The key restriction was stated by Wolfe [19]: 

“Each tile is a unit of computation to be scheduled on a pro- 
cessor. Once a tile is scheduled ... it runs to completion without 
preemption. A tile will not be initiated until all dependence con- 
straints for that tile are satisfied, so there will never be a reason 
that a tile, once started, should have to relinquish the processor.” 

We call this the atomicity requirement . 

The second, less fundamental but nevertheless important restriction is 
that the tiling should be expressible as a transformation of the original pro- 
gram. For this reason, we restrict our attention to partitions of S achieved 
by cutting S along hyperplanes. Wolfe’s original tilings used planes normal 
to the natural coordinate axes. Here, we allow arbitrary planes with integer 
normals. If we want to cut up S along hyperplanes normal to the integer 
vector a, we first apply loop index transformation to one of the original 
loops, replacing its index with a T i . We then strip mine this loop and bring 
the strip loop to the outermost position. 

2.1 Definitions 

First, we define the type of partition of S that we are considering. Let an 
integer vector a G Z k and an integer block size parameter 0 be given. The 
partition induced by a and 0 is given by (1) where 

Sj = {t € S | (min a T i) + (j - 1)0 < a T i < (min a T i) + j0} . 
i€«S «€o 

(Imagine a knife aligned so that a is normal to its flat side, cutting S into 
slices of equal thickness 0.) 
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We associate with S and D the dependence graph G = G(S,D ) with 
vertices S and edges 

E = {(t, i') G S x S | 3 a column dj of D 3 i + dj = i'} . 

We assume that G is acyclic. (If the dependence graph comes from a loop 
nest in an imperative language like Fortran, then G has to be acyclic.) 


Definition 1 The set 

C = C(D) = {iGR*| D t x > 0} 

is called the time cone of D. (The inequality is interpreted componentwise .) 

Note that C is an open, convex set closed under multiplication by a positive 
scalar - i.e., C is in fact a cone. It is polygonal, the intersection of the 
half spaces {djx > 0}^. We call C the time cone, without mentioning D, 
whenever there is no ambiguity. 

The closure of C is also important; it is defined by 
C = £(D) = {x 6 R* | D T x > 0}. 


Two subsets of C are important here. First, we must choose, as the 
normals to the hyperplanes used to partition 5, integer vectors in C. The 
intersection of £ with the surface of the unit sphere in R* (with the euclidean 
norm) also plays a role. 

Lemma 2 If C is nonempty, then G(S, D) is acyclic. 

For the proof, observe that the iterations may be performed in order of in- 
creasing value of x T i where x is any vector in C. Because all dependence 
displacements dj make an acute angle with such an x, no dependence con- 
straint is violated. We may therefore interpret x T i as the time at which 
iteration i is to be performed, hence the name we have given C. Points of 
S with equal value of x^i axe independent of one another and can be exe- 
cuted in any order - or in parallel, for that matter. This is the essence of 
Lamport’s hyperplane method for the parallel execution of do- loops [13], 
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Again, if D results from a loop nest in Fortran or a language like it, 
we can show that C is not empty. In fact, it is easy to see that D has the 
property that the first nonzero element of every column is positive (t.e. it is 
lexicographically positive). From this, the nonemptyness of C easily follows. 

We can now show how to choose hyperplanes for partitioning S in such 
a manner that Wolfe’s atomicity requirement is satisfied. First, we restate 
the requirement in terms of the quotient of the dependence graph under the 
partition (1). 

Definition 2 The quotient graph of G = G(S,D ) under the partition (1) 
is the graph with vertices {Si,... ,S,} and edges 

{(S P ,S 9 ) | 3 i p G S p and i q eS q 3 ( i p ,i q ) is an edge of G} . 

The atomicity requirement is equivalent to the requirement that the 
quotient graph be acyclic. A sufficient condition for this is the following. 

Lemma 3 The quotient graph of G(S f D) tinder the partition induced by a 
and /? is acyclic if a £(?. 

For the proof, observe that, by their definition, the subsets of the partition 
induced by a and /? may be ordered according to the values taken by a^ i on 
them. It follows from the definition of C that no point in a lower numbered 
subset can depend on any point in a higher numbered subset; if there were 
such a pair, say a point x that depends on a point y such that x — y = d for 
some column d of D, then d makes an obtuse angle with a, i.e. f a T d < 0, 
since by assumption a? x < a?y. But by definition, aF d > 0 for all columns 
d of D. 

Moldovan and Fortes [15] have used this technique for the synthesis of 
systolic arrays without cyclic data flow, which allows the array to be used 
to solve problems larger than the array. They gave no method for choosing 
the hyperplanes. The material of this section was also presented by Irigoin 
and Triolet [11]. 


15 



2.2 Tiling with Hyperplanes 


From the discussion above, we see that a valid partition of S may be obtained 
by choosing any integer vector in C. The tiles so obtained are slices of the 
index space S. These are not sufficiently small, however, to allow for all 
their necessary data to fit in the local memory of a given computer. 

In terms of the corresponding program, tiling by slicing with a single 
hyperplane can be achieved by a loop reindexing of one loop followed by 
strip mining of that loop (and only that loop) and interchange to make 
the one resulting strip-loop outermost. In the case of matrix multiply, for 
example, this would result in a four-loop program in which the innermost 
three loops do n 2 b operations and use n 2 data. (For, no matter which loop 
we strip mine, one of the matrices is indexed by the two unchanged loop 
indices and so is completely accessed.) 

As the matrix multiply example indicates, we need to be able to strip 
mine all the loops in order to be able to work with tiles whose data sets can 
be made arbitrarily small. In this section we investigate the problem of fully 
tiling loop nests. 

We can state this problem as follows. Given the index space S C Z* 
and the dependence matrix D, choose k linearly independent vectors A = 
. . ,afc] (the columns of A are a basis for k-space) such that each aj g C. 

The partition induced by A and a k-vector of block size parameters 
b is obtained by slicing S into slices of thickness /?i with a knife aligned 
perpendicular to ai , then slicing again with thickness /?2 and with the knife 
rotated so that it is perpendicular to (making long, narrow strips rather 
than slices) and so on, until one has sliced k times, finally obtaining tiles 
that are shaped, except at the boundaries of S, like parallelepipeds whose 
faces are perpendicular to the basis vectors. 

Thus, in order to fully tile a loop nest with arbitrary dependences D , we 
must be able to choose a basis in the cone C. 

Should we be satisfied with any such basis? What if its elements are 
nearly linearly dependent? Then we have tiles that are quite elongated, 
with some very small angles and a low ratio of volume (which measures the 
number of lattice points, or iterations to be performed) to surface area. The 
surface area is a measure of the amount of data that must be moved into 
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local memory in order to carry out the work of a tile. In general, the data 
moved in is the data required because of dependences of iterations in the 
tile on iterations of other tiles. The iterations near the edges require this 
data from outside. 

The ( k — 1) dimensional volume of the tile, which grows like n>=i Pj* 
also a measure of the amount of local memory needed to carry out the work 
of each tile. 

Let us therefore calculate how the choice of A determines the volume- 
to-surface ratio of the induced tiles. We first answer the question for the 
tiling that results when b = (/?, /?, . . . , fi) T . We obtain a formula for the ratio 
when >9=1, then we show how varying /? changes both the ratio and the 
amount of local storage needed. In later sections we consider generalizing 
to tiles with non-unit aspect ratios. 


2.3 Geometric Considerations 

First, we note that if a € C, then so is aa for any positive scalar a. The 
partition induced by A and b is unchanged if we scale the columns of A by 
arbitrary positive amount and scale the corresponding components of b by 
the reciprocal amounts. There is therefore no loss of generality if we replace 
A with A, the matrix obtained by scaling the columns of A to have unit 
euclidean length. 

We first assume that b = (/?,/?,... ,/?) T . Let /? = 1. Then except at the 
boundaries, all tiles are congruent to 

T = {x € K k | 0 < x T aj < 1 , V 1 < j < k} . (2) 

T is a parallelepiped subtended by the columns of the inverse of A . In 
other words, if F = [/i, ...,/*] = , then 

k 

T= {* = !>;/; |0<oy<l}. (3) 

j=l 

To see this, note that fjaj = S k j, so for any x that satisfies (3), x T aj = 
ajfj aj = Qj, and since 0 < a j < 1, equation (2) is satisfied. 

Let V(T) denote the volume of T. Then 

V(T) = \det(F)\ = \det(A)\~ 1 . 
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Let us now consider the faces of T. Without loss of generality, consider the 

face T^subtended by / 2 , , /*- The face is itself a (k - 1) dimensional 

parallelepiped. We want to know its surface area, or in general its ( k — 1) 
dimensional volume, which we denote V(T^). 

Lemma 4 V(T <»>) = |det(F)| = V(T). 

We give a proof, unfortunately algebraic rather than geometric in nature, in 
the Appendix. 

What are the consequences of the lemma? we see that all the faces have 
the same area and that it is equal to the volume of 7”. Thus, the ratio p(7*) 
of the volume to the total surface area of T is just the reciprocal of the 
number of sides, 2k: 

Theorem 5 For any k x k matrix A with unit-length columns, the paral- 
lelepiped T defined by (2) has a ratio of volume to surface area of 

» = **)=-£• 

At first this is surprising, since if A is far from having orthogonal columns 
we would expect a lower ratio. The explanation is that the constant ratio 
has been obtained because the size of T grows as A loses orthogonality. 
(Scaling up the size of any Jb-dimensional object by a factor <f> increases the 
ratio by the factor <f>.) 

The problem we have is to make the ratio p as big as possible subject 
to some limit, p on the tile cross section. This is because, as we shall 
show in detail later (and it is dear intuitively), the cross section of a tile is 
proportional to the amount of local memory needed to execute it. The cross 
section of T is also roughly equal to | det(F)|. To satisfy such a bound, we 
must change the size of T. To keep the problem simple, we shall for the 
present consider rescaling 6 by a constant factor (3. Let fi be chosen so that 
the area of a face, F(T), is exactly p. We have that the volume and area of 
the rescaled tile are 

V(T) = t> k |det(F)| 

and 

F(T) = pl k ~ l '> |det(F)|. 
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Thus, we must choose 


P < (M |det(i4)|) 1/( * 

We can then achieve the ratio of volume to surface area 


Pmax(p 5 A) “ 


2k 


On the other hand, if we wish a ratio p* of volume to surface area, we need 
tiles of dimension /?* = 2k p*. Therefore, we must be able to hold tiles whose 
sides have area 


Mmin — 


(P*) k ~ 1 

1 det(yi)| 

(4) 

(2kp*) k ~ 1 
|det(I)| ’ 

(5) 


We can, because of these observations, now state the optimality problem 
we would like to solve: Given a cone find an integer basis whose elements 
are in the cone. Choose them so that the matrix having the scaled basis vec- 
tors as columns has largest possible determinant. (We call the determinant 
of this scaled matrix the normalized determinant . 

This problem is related to, but is not the same as, choosing A to minimize 
its spectral condition number under the constraints D T A > 0. (See [10] for 
the definition and properties of matrix condition numbers.) Consider the 
vector of singular values of A . The normalization condition places it on the 
unit sphere in R*. The condition number is the ratio of the largest to the 
smallest component; the determinant is the product of the components. In 
the unconstrained case, both are optimized by the vector of equal singular 
values. In the constrained case, however, the optimizing matrices can differ. 

Of course, for general there may be no maximizer among the integer 
bases in the cone. And we do not know whether there is always a maximizing 
choice when £ comes from an integer dependence matrix D. 

We can view this problem as the maximization of the real valued function 
| det(A)| over the k 2 dimensional space of integer matrices A, subject to m 
linear inequality constraints A > 0. It might be fruitful to use a standard 
method for the continuous problem and then convert the solution to integer 
by some rounding-off procedure; we have not pursued this approach. 
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In the next section, we consider a heuristic method for choosing the basis 
A. 

3 A Procedure for Choosing the Tiling Basis 


In this section, we describe a practical procedure for choosing a tiling basis 
A given the dependence matrix D. The procedure’s complexity is a function 
of k, the nesting depth of the loop; m, the number of dependence directions; 
and p, the number of rays of the cone C. (We define what we mean by 
the rays of a polygonal cone below.) In these terms, the complexity of the 
procedure is 0(pk 2 + k 3 + m k ~ x k 2 ). While the exponential term here may 
be cause for some uneasiness, the reader should keep in mind that in the 
practical application of these ideas k will rarely exceed four. 

The procedure can be described as follows: 

1. Construct the set of rays of the cone C. A ray is a vector r £ Z k that 
is on the boundary of C and is at the intersection of at least k — 1 of the half 
spaces { djr = 0). Thus, the rays satisfy 

Dzr = [d<r(i), 2 ), . . . ,<^(*- 1 )] r = 0 (6) 

where E = {a(l), <r(2), . . . , a(k- 1)} is a subset of the integers {1, 2, . . . , m}. 
This is a necessary condition. Let us suppose that there is a unique integer 
solution (up to scaling) of equation (6). For the solution r to be a ray, 
we must check whether D T r > 0. We also check to see whether D^r < 0 
because, if that is the case, then -r is a ray of C. If we find that the rows 
of D selected by E are linearly dependent so that (6) has a two or more 
dimensional subspace of solutions, then we just ignore the set E. 


The method we use for the construction is simply to form all of the 

^ ^ i ^ subsets E and then solve (6) for r. Our implementation uses a 

QR factorization with column pivoting, which is very effective at detecting 
linear dependence of the columns Dz (10]. It is straightforward to find the 
integer solution to (6) by computing a solution in floating-point and then 
finding the smallest scalar multiple that makes the solution integer (after 
perturbations on the order of roundoff error). Implementations that use 
only integer arithmetic would also be feasible and perhaps better. 
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The complexity of these decompositions is 0(fc 3 ). However, we may 
update the QR decomposition after changing one column of the matrix at 
cost 0(1 t 2 ). Bischof has recently shown how to do so and still monitor the 
linear independence of columns [2]. In our experiments, we do not use such 
an updating procedure. 

We must consider the case in which itself has a nontrivial null space, 
which in fact happens quite often. In this case, the set £ is a wedge, 

£ = Af (& £ i 

where Af is the null space of and £\ is the intersection of £ with the 
orthogonal complement of A f, the row space of • To detect this case, we 
always start with a QR factorization of D T itself. This allows us to find 
the rank of D and an integer basis for the null space of D T in a standard 
manner. We then construct the rays of £\ by applying a variant of the 
procedure above to the augmented matrix [D,N], where the columns of N 
are the computed basis for Af* In the variant, the subsets E always include 
all of the columns of N , and enough other columns to make up a set of 
k - 1. The resulting rays must therefore be members of £\\ together with 
the columns of N they are the of rays of C . 

Having obtained the rays as the columns of a matrix R = [rj, J* 2 , . • • , rp]i 
we next choose as our first approximation to the optimal basis a subset of 
these rays. As the cone is invariant under scaling of these rays, we normalize 
them so that their length is one. Then we select a subset of k of them, chosen 
to approximately minimize the condition number of the subset. (We show 
below that this also results in a nearly maximum determinant.) This is 
a standard problem, called subset selection , in statistics. We employ the 
heuristic procedure of Golub, Klema, and Stewart [9], which is described in 
the text of Golub and Van Loan [10]. This procedure involves a singular 
value decomposition (SVD) of R and the QR decomposition with column 
pivoting of a matrix that is part of the SVD, with an overall cost of rk 2 + 6fc 3 
floating-point operations (an operation being a floating-point addition and 
a floating-point multiplication). 

We know of no method for finding the optimal subset of rays other than 

an exhaustive search, at a cost of | ( jQ k 3 flops. The relative costs of our 

implementation and exhaustive search for the optimum subset are illustrated 
in Figures 2 — 4. Obviously, the exhaustive procedure is prohibitively 
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Operation Counts for Selection in 2 Dimensions 



Figure 2: Operation counts versus number of rays for selection in 2 dimen- 
sions. Solid line: Subset selection; Broken line: Exhaustive search. 


22 



xlO 4 Operation Counts for Selection in 3 Dimension i 



Figure 3: Operation counts versus number of rays for selection in 3 dimen- 
sions. Solid line: Subset selection; Broken line: Exhaustive search. 
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xlO 5 Operation Counu for Selection in 4 Dimension) 

3 


Z5 



”5 6 7 8 9 10 11 12 

Number of Rmys 

Figure 4: Operation counts versus number of rays for selection in 4 dimen- 
sions. Solid line: Subset selection; Broken line: Exhaustive search. 

expensive for large problems, but may be used for k — 2, for k = 3 and 
p < 6, and for k = 4 and p < 6. On the other hand, subset selection does 
very well. In a test of 1000 randomly generated 3x6 matrices D, subset 
selection produced a suboptima] choice in 18 cases. In the worst of these, 
the determinant of the basis that it found was 17% smaller than that of the 
optimum basis. 

The basis chosen at this point may be far from optimal. Consider the 



The two rays of the cone are the columns of 



These rays make an angle of 135°; clearly there are orthogonal bases whose 
elements are in C, but not all at the boundaries. To catch cases like this, we 
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have implemented a generalized orthogonalization process. Let angle(x,y) 
denote the angle between the vectors x and y, given by 

angled, y) = mccos • 

The procedure is 


for j = 1 to k do 

Find 1 < t < k such that angle(a;,o,) is maximum; 
if (angle(a,-,aj) > tt/ 2) do 
aj = aj - (af dj / aj aj)ai 

so that aj is orthogonal to a,*; 

Replace aj with an integer vector in C 
of approximately the same direction; 
od 
°d 

if D t A > 0 and the normalized determinant is larger than before 
improvement, accept the new A y else use the old one; 


In a test of 1000 randomly generated 3x6 dependence matrices jD, the 
basis selected by finding the rays of C and then using the subset selection 
procedure above was improved by this procedure. The average determinant 
was improved 14%, from .63 to .71 . In comparisons with several similar pro- 
cedures, this one did the best job of maximizing the normalized determinant. 
We also considered the following variants: 

1. As above, but replace a,* rather than aj after making it orthogonal to 

2. For j = 1 to fc, aj is made orthogonal to each other basis vector with 
which it makes an obtuse angle; this continues until there axe no such 
obtuse angles involving aj. 

3. For every pair of basis vectors a,- and aj with i < j, orthogonalize aj 
and aj by adding a multiple of aj to a,-. 

4. For every pair of basis vectors and aj with i < j, orthogonalize aj 
and a,- by adding a multiple of a* to aj. 
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Procedures 2, 3, and 4 are more costly with little in the way of improved 
performance. Procedure 1 actually does worse. Thus, we recommend the 
use of the procedure above. 


4 Other Applications 


The same technique of tiling loop nests can be used in other contexts, for 
example: 

1. The synthesis of systolic arrays. We may design an array large enough 
to handle a single tile of some given size; the overall computation can be 
performed by the small systolic array regardless of the size of the data, 
by tiling the index space and using the array on the individual tiles. This 
technique was proposed originally by Moldovan and Fortes [15], who did not 
specify how to choose the hyperplanes; we have filled in that gap. 

2. The decomposition of work into tasks that can be executed in par- 
allel on a shared-memory multiprocessor. This technique can find tasks 
of medium to large granularity that require little communication through 
shared memory. It is straightforward to prove that, for sufficiently large 
block sizes, the dependence vectors in the quotient index space are all pos- 
itive. Thus, we may execute tiles simultaneously if the sum of their tile 
indices is equal. This approach is currently being pursued by some manu- 
facturers of shared-memory parallel MIMD machines. This paper enhances 
that technique by allowing for more effective decompositions. 


5 Precise Storage and I/O Requirements 


In this section, we develop formulae that give precise measures of the storage 
required for execution of a tile and of the number of data (input and output 
from local memory) required for execution of a tile. These can be used 
to state more precisely the optimization problem that should be solved in 
determining the tiling basis. 

Consider I/O requirements first. Now, let E be an integer Ixm' matrix 
whose columns represent the data required to satisfy the true dependences 
in an index space. Consider the loop nest 
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for i = 1 to n do 
for j = 1 to n do 

o[i,j] = a[i,j - 1] + b[i,j - 1] ; 

b[i,j] = a[i,j - 1] + b[i - 1 ,j - 1] + c[i] ; 

od 


In this loop nest, the dependences are 



A given iteration requires one datum from the iteration at distance (1 1) T 
and two data from the iteration at distance (0 1) T • Thus, the matrix E is 



In addition to the data computed at other iterations in the index space, 
for which dependence directions have been established, other data may be 
required in order to execute a tile, for example, the c data in the example 
above. We express these data requirements through a second matrix, C. 
Each column of C corresponds to a datum (such as c[i] in the example) 
that is used in common by a number of iterations. It gives the smallest 
displacement in the index space between iterations that use the datum. For 
the example above, 

•-(!)■ 

since all iterations with fixed t use the value c[*J. If, for example, c0[t] were 
used for j = 0,2, . . . , n — 1 and cl[t] were used at iterations j = 1, 3, . . . , n, 
then we would have 


We are now ready to state the I/O required to execute a tile. We as- 
sume that no data are available in local memory to begin with and that all 
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data that may be needed later must be written back from local memory at 
completion of the tile’s execution. Let E = [e r ]?Li- Let C = [c r ]^i. The 
amount of data is given by 


Data(A,6) 


V(T«) + E( 2c ^ a i) 

i=i L \ r=1 


r=l 


£v(T«)(e T «2j;,Cfa i )) . 


Here V (T ^ ) is the volume of the face of the tile normal to the tiling basis 
vector aj, Sj is a normalized tiling basis vector, and e 7 = (1, 1, . . . , 1). That 
this is correct follows from the observation that the grid points at the face 
of a tile depend on values created at iteration points in a “shadow”; the 
shadow points are points not in the tile from which a dependence into the 
tile emanates. For each column d of E the corresponding shadow has as 
its base a face of the tile, say the face normal to flj, and as one of its sides 
the vector whose direction is — d and whose tail is at any corner of the 
face. This shadow has height and base area V(T^) y so it has volume 
V(T^)d T fcj. The factor 2 multiplying E expresses the fact that data that 
are responsible for dependences must be read in and written out, while data 
that are used but not created axe read in but not written back. 

The volume of faces is explained in Section 2.3. 


5.1 Choosing the Ordering of the Block Loops 

A consequence of the requirement that A > 0 is that the block loops 
may appear in any order. Suppose, without loss of generality, that 

e T ([2E,C] T a x ) = ?j!kxe T ([2E,C] T lij ) . 

Then the flux of data per unit surface area across the faces of the tiles 
normal to aj is greater than that across the other faces. We would choose to 
make the j x block loop innermost. This is because we would avoid storing 
to memory the data that flow across the faces normal to fli when going 
from one tile to the next. This has the effect, for example, of causing us 
to choose a “left-looking” block Gaussian elimination or block Householder 
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Q R method in preference to a “right-looking” method, which helps to reduce 
the memory traffic further. (The advantages of this loop ordering have been 
discovered already for several matrix computations [6].) See the examples 
of Section 7 for illustration of how this technique should be applied. 


5.2 Local Memory Requirements 

We will make the simplifying assumption that the same computation, pro- 
ducing and consuming the same number of data, is done at every point of 
S . The memory required to execute a tile depends on the order in which 
the individual points of the tile are executed. For this analysis, we assume 
that the points along hyperplanes normal to a given integer A;- vector r are 
executed simultaneously. We need to store the values produced at earlier it- 
erations that are required by the iterations on this hyperplane. The number 
of such values is again given by the sum of volumes of “shadows” as 

Mem(A,6,r) = [max V(r,t)] (e T ([2E,C] T f)') . 

Here V(r, t) is the volume of the intersection of the hyperplane r T t = t and 
the given tile, i.e., of the set of iteration points computed at time t. The 
maximum is taken over the relevant values of t. 

This largest volume is a function of the tile dimensions and of the shape 
of the tile as well as of the time coordinate r. In general, it can be larger 
than the faces, but by no more than a constant factor of at most 2*" 1 . It 
may be much smaller, as in this case: Let 

“C”) 

and let fa = 1 and fa = 10 so that the tiles are long and narrow and are 
nearly aligned with the t'i axis. The faces of these tiles have lengths of 10 
and about 10.5. If we take r = (0 1) T , then the set of points in the tile 
that are simultaneously executed is small; there are at most two. On the 
other hand, if we choose r = (1 0) r , then there are 10 such points. So our 
earlier assertion that face volume is a good measure of memory required is 
in doubt. 

This is not, however, a real possibility. The example above depends on 
highly elongated tiles. This happens because the basis vectors (the columns 
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of A) are dose, and this in turn is due to a narrow cone <?. But in order 
for r to be used in scheduling as described above, we must have t E C. 
The difficulties described above are assodated with a choice of r nearly 
orthogonal to all of the basis vectors a, which are confined to lie in a narrow 
cone. Such a vector cannot also be in the cone. 


6 Optimal Choice of Block Size 


In this section we present solutions to two important instances of the general 
problem of optimal choice of the block size parameters b — [/^i , • • • , 0k\- l^e 
assume that a set of tiling hyperplane normals A has been chosen and that 
the data fluxes E and C are known, as are the dependences D. The choice 
of the outermost point loop index — r, will also play a role. 

Here our viewpoint is somewhat more realistic than in Section 3. We 
take into account the fact that not all the data required by the execution 
of a tile must be read a priori. Instead, we consider the order in which the 
tiles iterations are processed and assume that the needed data are read (or 
written) at the time they are needed. 

We need some constants to make our estimates precise. Let the amount 
of work per grid point be ui. (The appropriate units for u and the constants 
<f>j that follow are seconds, so that the machine characteristics are included 
through these constants.) Let the flux of data per unit surface area across 
the face normal to aj be <f>j. The way that <f>j depends on E, C , and Oj was 
explained in Section 5. 

First we consider the case k = 2 with the assumption that r is one of the 
two tiling vectors, say r = a\. Then the amount of local memory required 
is proportional to /?2 and is independent of /3j . The total work done is 
and the amount of data referred to is 4>iP\ 4- <t> 2 & 2 ‘ Thus the ratio of 
computation time to memory access time is 

_ u>/?i/?2 ^ ufh 

P <l> lfil + ^ 2^2 4>1 

as /?j — ► oo. (We have redefined the dimensionless parameter p here.) See 
Figure 5. 

In this case, therefore, we always take as large as possible (subject 
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only to the size of the problem being solved) and obtain the ratio shown. 
This ratio is the product of a ratio of work per iteration u> to data per 
iteration fa and the number of grid points fa that fit in the local memory. 
Note in particular that for large problems, for which fa can be taken so 
large that the asymptote is approached, neither the data per unit surface in 
the direction of 02 (that is, <fa) nor the particular choice of tile length in the 
<H direction (that is, fix) plays a role. Similar conclusions are reached if we 
model execution time rather than the computation to communication ratio 
p. Note also that if the ratio fix /fa is larger than ( 5 ?/ fa , then we choose 
t = 02 instead of 01. 

The discussion above is little changed if we allow arbitrary r. What 
matters is that we fix all but one of the block size parameters and allow 
the other to grow, provided that with the given choice of r the memory 
requirement is independent of this one parameter. For that to be true, all 
we need is that T should not be close to 03 rather than the much stronger 
requirement r = Oj. 

Next let us take r = ai and k > 2 . Again, we fix all but one of the block 
size parameters, in this case fa, • • • , and and allow the other one to grow, 
limited only by problem size. 

Let B as Hi fir Memory size places some upper limit on B. Let the 
memory required per unit surface in the hyperplane normal to ax be M . 
Thus, for the given choice of the block size parameters, the local memory 
required is MB f | det(A)|. If the available local memory has room for p data, 
then B is constrained by 

B < p\ det(ji)|/Af . ( 7 ) 

The amount of work per unit distance in the ax direction is uB / | det(A)|. 

Finally, the data required per unit distance in the a\ direction is 



Thus 


W 
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With the constraint on b given by (7), the maximizing choice of b is 
ft = |det(i)|/M) 1 '<*- 1 > 


where $ = Ilj=2^i* 

7 Blocking Examples 

Our first example is an algorithm that uses plane rotations for the QR 
factorization of real m x n matrix X. In the description of these example 
algorithms we suppress all irrelevant detail. To that end, we use the notation 
f(x, y,z) to mean a generic function of the arguments x,y, and z which may 
be a different function at every occurrence. 


( 1 ) 

( 2 ) 


for k = 1 to n do 

for i = m to k + 1 step —1 do 
(c,s) = /(*(<, k),x(i - l,fc)); 
for j = k + 1 to n do 

x(i-l,j) 1 , /[ *(*- hj) 

*(*,;) J 1 \[ x (*>i) 

od 



od 


od 


There are two distinct true dependences here. Statement (2) at iteration 
(i,j,k) depends on statement (2) at iterations (i + 1 ,j,k) (because x(ij) 
is changed there) (t - 1, j,Jb - 1) (because x(i-lj) is changed there). Each 
iteration (i, j, fc) of statment (2) depends on statement (1) at iteration (i t fc), 
so that (0, 1,0) T is a column of C. Furthermore, statement (1) depends on 
statement (2) at iterations (»,fc,fc— 1) and (i-l,k,k-l). Therefore, through 
the uses of c and s, statement (2) depends on itself at iterations (i, k,k — 1) 
and (i- 1, k,k-l); this dependence is weaker that a dependence on iteration 
(i,j - 1, k - 1) and (» - 1 J - 1, k - 1), so if we take these to be the actual 
dependences we are going to be safe. There are also antidependences and 
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output dependences, but these can be ignored for the moment. Thus, 

D = 

and 

C = 

In this case, there are only three rays of the cone, namely, 




After improvement we arrive at the basis 



Thus, the new indices are j, k, and k — i. 

After replacing the index i by r s k — i we have the following program: 


for k = 1 to n do 

for t = k - m to —1 do 

(c(r,fc),s(r,fc)) = f{x{k - r,k),x{k - r - 1,*)); 


f x(fc-r-l,j) 1 _ A 

x(k-T- 1, j) 

H 

1 

1 

x(k - rj ) 


od 


c(r. 


,k),s(r,k)j ; 


od 


od 


Strip mining produces 
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for ArO = 1 to n step b do 

for A: = ArO to min(n, A;0 + b — 1) do 
for rO = kO — m to — 1 step b do 

for r = max(rO, Ar — m) to min(— l,rO + 6 — 1) do 
(c(r y k),s(r,k)) = f(x(k - r y k) y x{k - r - 1,*)); 
for j’O = fcO + 1 to n step b do 

for j = max(fc + 1, jO) to min(n, j'0 + 6 - 1) do 


J — i _ / \ *•* ' _ . 

x(k- T - ij) ] = / (7 *<* 7 r : ^ c(T,k)Xr .*) ) ; 
x(fc-r,j) J x{k-r, j) J / 


Then loop interchanging produces 


for JbO = 1 to n step 6 do 

for rO = fcO — to to —1 step b do 
for jO = fcO to n step b do 

for k = kO to min(n,fcO + 6 — 1) do 

for r = max(rO, k — m) to min(— 1, rO + b — 1) do 

if JO = k 0 then ( c(r,k),s(r,k )) = /(x(fc - r,k),x(k -r- 1,*)); 
for j = max(fc + l,j'0 + 1) to min(n, jO + b — 1) do 


tdlrjf]- ' ([' ]*•*>•*'*>) ; 
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Figure 6: Blocking of the QR factorization of an 20 x 15 matrix with /? 



This rather complicated blocked algorithm works as follows. We illus- 
trate with m = 20, n = 15, b = (5,5,5). Elements of X are eliminated 
by plane rotations in patches, as shown in Figure 6. The values of kO and 
rO at which elements axe eliminated is shown in each patch. The rotations 
used to do the elimination are applied only to columns in the current patch 
(during the block operation with jO = kO). These rotations are stored and 
later applied to columns to the right of the patch (when j 0 > fcO). 

For another example, consider the following program for the QR factor- 
ization which uses Householder transforms rather than plane rotations. In 
this pseudo-code we use the notation x(k : fn,j) to refer to the vector of 
elements [x(Jb, j), x(k + 1 J) . . . , x(m y j) ]. We include it as an example of a 
program that can be blocked without using linear loop index transformation . 


for k = 1 to n do 

s(k) = \\x(k :m,Ar)||; 
x(k,k) = f(x(k,k) y s(k)); 
for j s= k + 1 to n step do 

s'(kj) = f(s(k),x(k : m, k) T x(k : 
x(k : m,j) = x (k : m,j) + s'(k,j) * x(k : m,k)\ 
od 


od 


In Fortran, loops would be triply nested. The compiler, on detecting 
a dependence of some subsequent statement on the whole of an inner loop 
implementing a reduction operation, such as the norm and the inner product 
in the example, should choose to view those loops as atomic and therefore 
work with an index space of reduced dimension. 

The dependences in ( j y k ) space are 


D = 



The basis chosen is the obvious one: A = /. Thus, no skewing is done. 

Now, we choose the order of the block loops. The measure of data flux 
given in Section 5 is the same for &ud for \ so neither order is preferred. 
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Note, however, that the two dependences are different in character. The 
(0, l) r dependence is a true dependence at every point of the index space. 
The other, (l,0) r expresses the dependence of iteration (j,k) on “iteration” 
(it, k) (the task performed outside the inner loop for given k ); the data that 
are required are used in common by all the iterations with fixed k. Thus, for 
the purpose of determining data flux, this dependence direction should be 
given weight 1 (as are columns of C), not 2. If we make this change, the flux 
is greater for <*2» 80 we make the k block loop innermost. This procedure 
yields a left-looking method in which all groups of Householder transforms 
are applied to a block of columns just before that block is triangularized. 
This allows the block to be held in local memory during the application of 
these transforms. 
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Appendix. Proof of Lemma 3. 

Let the Jfc x k — 1 matrices F\ = [/ 2 , • ••,/*] and M = [ 02 ,..., a*]. 
The face is subtended by the columns of F\ . Let the matrix F\ be 
factored 


Ft = QR 

— [Qi 91] 


R 1 
0 


( 8 ) 

(9) 


where Q is an orthogonal matrix, R is an upper triangular matrix, Qi is 
k X * - 1 , and Rt is Jfc - 1 X k - 1; thus Ft = Q\R\ = QR, and q x is a 
unit vector in the direction normal to the range of F\ , which is the span 
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of {ai}. The matrix Pi = QiQf is the orthogonal projector on {ai}- 1 . The 
factorization (9) always exists and is unique up to signs on the diagonal of 
R [10]. 

The columns of Pi are the coordinates of the columns of Pi with re- 
spect to the orthonormal basis (for the subspace of It* in which 7”^ lies) 
consisting of the columns of Q\. Thus 

V(7"W) = | det(Pi)| . 

We must therefore show that |det(Pi)| = |det(F)| = |det(A)| _1 . 

From the identity I = F T A it follows that Pjt— i = Pi T Ai = RfQf A\ ; 
thus 

|det(Fi)| = |det(Q?’ At fl" 1 . 

The proof is complete if we can show that det(g^ A\ ) = det(A). But 
since QtQi = h- i, 

det(gfAi) 2 3 4 = det([ At T QiQt][QiQt ^i ]) 

= det([Pi Ai] t [Pi Ai]). 

The result now follows from Lemma 1. | 
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